risk curve
Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws
Arous, Gérard Ben, Erdogdu, Murat A., Vural, N. Mert, Wu, Denny
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $σ$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^β$ for $β\in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $λ_j\asymp j^{-α}$ for $α\geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.
- North America > Canada > Ontario > Toronto (0.14)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- North America > United States > New York (0.04)
- Asia > Middle East > Jordan (0.04)
Uncertainty quantification for iterative algorithms in linear models with application to early stopping
This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\sqrt n$-consistent under Gaussian designs. Applications to early-stopping are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for developing debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > France (0.04)
Problem-Dependent Power of Quantum Neural Networks on Multi-Class Classification
Du, Yuxuan, Yang, Yibo, Tao, Dacheng, Hsieh, Min-Hsiu
Quantum neural networks (QNNs) have become an important tool for understanding the physical world, but their advantages and limitations are not fully understood. Some QNNs with specific encoding methods can be efficiently simulated by classical surrogates, while others with quantum memory may perform better than classical classifiers. Here we systematically investigate the problem-dependent power of quantum neural classifiers (QCs) on multi-class classification tasks. Through the analysis of expected risk, a measure that weighs the training loss and the generalization error of a classifier jointly, we identify two key findings: first, the training loss dominates the power rather than the generalization ability; second, QCs undergo a U-shaped risk curve, in contrast to the double-descent risk curve of deep neural classifiers. We also reveal the intrinsic connection between optimal QCs and the Helstrom bound and the equiangular tight frame. Using these findings, we propose a method that uses loss dynamics to probe whether a QC may be more effective than a classical classifier on a particular learning task. Numerical results demonstrate the effectiveness of our approach to explain the superiority of QCs over multilayer Perceptron on parity datasets and their limitations over convolutional neural networks on image datasets. Our work sheds light on the problem-dependent power of QNNs and offers a practical tool for evaluating their potential merit.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- (4 more...)
Sketched Ridgeless Linear Regression: The Role of Downsampling
Chen, Xin, Zeng, Yicheng, Yang, Siyue, Sun, Qiang
Overparametrization often helps improve the generalization performance. This paper presents a dual view of overparametrization suggesting that downsampling may also help generalize. Focusing on the proportional regime $m\asymp n \asymp p$, where $m$ represents the sketching size, $n$ is the sample size, and $p$ is the feature dimensionality, we investigate two out-of-sample prediction risks of the sketched ridgeless least square estimator. Our findings challenge conventional beliefs by showing that downsampling does not always harm generalization but can actually improve it in certain cases. We identify the optimal sketching size that minimizes out-of-sample prediction risks and demonstrate that the optimally sketched estimator exhibits stabler risk curves, eliminating the peaks of those for the full-sample estimator. To facilitate practical implementation, we propose an empirical procedure to determine the optimal sketching size. Finally, we extend our analysis to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > United States > New York (0.04)
- (7 more...)
Multiple Descent in the Multiple Random Feature Model
Meng, Xuran, Yao, Jianfeng, Cao, Yuan
Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a ''double random feature model'' (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to the ''multiple random feature model'' (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models.
- North America > United States (0.28)
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (3 more...)
Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple descent
At the core of statistical learning lies the problem of understanding the generalization performance (e.g., out-of-sample errors) of the learning algorithms in use. Conventional wisdom in statistics held that including too many covariates when training statistical models can hurt generalization (despite improving training accuracy), due to the undesired over-fit. This leads to the classical conclusion that: proper regularization -- through either adding certain penalty functions to the loss function or algorithmic self-regularization -- seems to be critical in achieving the desired accuracy (e.g., Friedman et al. (2001); Wei et al. (2019)). However, an evolving line of works in machine learning observes empirical evidence that suggests, to the surprise of many statisticians, over-parameterization is not necessarily harmful. Indeed, many machine learning models (such as random forests or deep neural networks) are trained until the training error vanishes to zero -- meaning that they are able to perfectly interpolate the data -- while still generalizing well (e.g., Zhang et al. (2021); Wyner et al. (2017); Belkin et al. (2019)). As a key observation to explain this phenomenon, many models when trained by gradient type methods (e.g., gradient descent, stochastic gradient descent, AdaBoost) converge to certain minimum norm interpolators, which implicitly favor models with smaller model complexity. These empirical mysteries inspire a recent flurry of activity towards understanding the generalization properties of various interpolators.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
Interpretable Additive Recurrent Neural Networks For Multivariate Clinical Time Series
Rahman, Asif, Chang, Yale, Rubin, Jonathan
Time series models with recurrent neural networks (RNNs) can have high accuracy but are unfortunately difficult to interpret as a result of feature-interactions, temporal-interactions, and non-linear transformations. Interpretability is important in domains like healthcare where constructing models that provide insight into the relationships they have learned are required to validate and trust model predictions. We want accurate time series models where users can understand the contribution of individual input features. We present the Interpretable-RNN (I-RNN) that balances model complexity and accuracy by forcing the relationship between variables in the model to be additive. Interactions are restricted between hidden states of the RNN and additively combined at the final step. I-RNN specifically captures the unique characteristics of clinical time series, which are unevenly sampled in time, asynchronously acquired, and have missing data. Importantly, the hidden state activations represent feature coefficients that correlate with the prediction target and can be visualized as risk curves that capture the global relationship between individual input features and the outcome. We evaluate the I-RNN model on the Physionet 2012 Challenge dataset to predict in-hospital mortality, and on a real-world clinical decision support task: predicting hemodynamic interventions in the intensive care unit. I-RNN provides explanations in the form of global and local feature importances comparable to highly intelligible models like decision trees trained on hand-engineered features while significantly outperforming them. I-RNN remains intelligible while providing accuracy comparable to state-of-the-art decay-based and interpolation-based recurrent time series models. The experimental results on real-world clinical datasets refute the myth that there is a tradeoff between accuracy and interpretability.
- Oceania > Australia (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
Are Deep Neural Networks Dramatically Overfitted?
If you are, like me, confused by why deep neural networks can generalize to out-of-sample data points without drastic overfitting, keep on reading. If you are like me, entering into the field of deep learning with experience in traditional machine learning, you may often ponder over this question: Since a typical deep neural network has so many parameters and training error can easily be perfect, it should surely suffer from substantial overfitting. How could it be ever generalized to out-of-sample data points? The effort in understanding why deep neural networks can generalize somehow reminds me of this interesting paper on System Biology -- "Can a biologist fix a radio?" (Lazebnik, 2002). If a biologist intends to fix a radio machine like how she works on a biological system, life could be hard. Because the full mechanism of the radio system is not revealed, poking small local functionalities might give some hints but it can hardly present all the interactions within the system, let alone the entire working flow. No matter whether you think it is relevant to DL, it is a very fun read.
Surprises in High-Dimensional Ridgeless Least Squares Interpolation
Hastie, Trevor, Montanari, Andrea, Rosset, Saharon, Tibshirani, Ryan J.
Modern deep learning models involve a huge number of parameters. In nearly all applications of these models, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment, Zhang et al. (2016) showed that state-of-the-art deep neural network architectures can be trained to interpolate the data even when the actual labels are replaced by entirely random ones. Despite their enormous complexity, deep neural networks are frequently seen to generalize well, in meaningful practical problems. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is usually taken to be a proxy for overfitting or poor generalization (large gap between training and test error). In an insightful series of papers, Belkin et al. (2018b,c,a) pointed out that these concepts are, in general, distinct, and interpolation does not contradict generalization. For example, kernel ridge regression is a relatively well-understood setting in which interpolation can coexist with good generalization (Liang and Rakhlin, 2018). In this paper, we examine the prediction risk of minimum l norm or "ridgeless" least squares regression, under
- North America > United States > Texas > Schleicher County (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Two models of double descent for weak features
Belkin, Mikhail, Hsu, Daniel, Xu, Ji
The "double descent" risk curve was proposed by Belkin, Hsu, Ma, and Mandal [Bel 18] to qualitatively describe the out-of-sample prediction performance of several variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance tradeoff with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications [BO98; AS17; Spi 18; Bel 18]. In these studies, a predictive model with p parameters is fit to a training sample of size n, and the test risk (i.e., out-of-sample error) is examined as a function of p. When p is below the sample size n, the test risk is governed by the usual bias-variance decomposition. As p is increased towards n, the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up towards infinity.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New York > New York County > New York City (0.04)